Skip to content

bpf/sockmap: add splice support for tcp_bpf#11277

Open
kernel-patches-daemon-bpf[bot] wants to merge 7 commits intobpf-next_basefrom
series/1061046=>bpf-next
Open

bpf/sockmap: add splice support for tcp_bpf#11277
kernel-patches-daemon-bpf[bot] wants to merge 7 commits intobpf-next_basefrom
series/1061046=>bpf-next

Conversation

@kernel-patches-daemon-bpf
Copy link

Pull request for series with
subject: bpf/sockmap: add splice support for tcp_bpf
version: 1
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=1061046

@kernel-patches-daemon-bpf
Copy link
Author

Upstream branch: 05c9b2e
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1061046
version: 1

@kernel-patches-review-bot
Copy link

commit fba0aa2abe5f462df7cd28193b8ea6a409b2881d
Author: Jiayuan Chen <jiayuan.chen@linux.dev>

tcp_bpf: refactor recvmsg with read actor abstraction

Refactors the tcp_bpf recvmsg paths to use a function pointer callback
(sk_msg_read_actor_t) instead of calling copy_page_to_iter() directly,
preparing for future splice support.

> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index 2e26174c9919..6a906bfe3aa4 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c

[ ... ]

> @@ -440,7 +442,8 @@ int __sk_msg_recvmsg(struct sock *sk, ...
>  			if (copy)
> -				copy = copy_page_to_iter(page, sge->offset, copy, iter);
> +				copy = actor(actor_arg, page,
> +					     sge->offset, copy);

The direct copy_page_to_iter() call is now replaced by the actor
callback here, but deeper in the same function the peek-path comment
still references copy_page_to_iter() by name:

net/core/skmsg.c:sk_msg_read_core() {
    ...
		} else {
			/* Lets not optimize peek case if copy_page_to_iter
			 * didn't copy the entire length lets just break.
			 */
			if (copy != sge->length)
				goto out;
    ...
}

Should this comment say "the actor" instead of "copy_page_to_iter"
now that the copy operation is abstracted behind the callback? Once
the splice actor is added later in this series, the reference to
copy_page_to_iter will be inaccurate for that path.



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: tcp_bpf: refactor recvmsg with read actor abstraction
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/22658462977

@kernel-patches-daemon-bpf
Copy link
Author

@kernel-patches-review-bot
Copy link

commit 0eb3c1a5d8b9657ba331b71bb22002d34d72bd00
Author: Jiayuan Chen <jiayuan.chen@linux.dev>

tcp_bpf: add splice_read support for sockmap

Adds splice_read support for sockmap by introducing sk_msg_splice_actor()
which copies pages into the pipe, and fixes sk_msg_read_core() to advance
sg.start when the actor fails mid-way through processing.

> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -445,6 +445,16 @@ int sk_msg_read_core(struct sock *sk, struct sk_psock *psock,
>  				copy = actor(actor_arg, page,
>  					     sge->offset, copy);
>  			if (!copy) {
> +				/*
> +				 * The loop processes msg_rx->sg entries
> +				 * sequentially and prior entries may
> +				 * already be consumed. Advance sg.start
> +				 * so the next call resumes at the correct
> +				 * entry, otherwise it would revisit
> +				 * zero-length entries and return -EFAULT.
> +				 */
> +				if (!peek)
> +					msg_rx->sg.start = i;
>  				copied = copied ? copied : -EFAULT;
>  				goto out;

The sg.start fix looks correct for the mid-way failure case.

When the splice actor (below) fails on the very first sg entry, copied
is still 0, so this returns -EFAULT to the caller. For the recvmsg
actor, -EFAULT is the right error because copy_page_to_iter() returning
0 means a genuine fault. But for the splice actor, 0 can mean the pipe
is full (add_to_pipe() returned -EAGAIN) or memory allocation failed.

Would it make sense for sk_msg_splice_actor() to propagate the actual
error, or for the callers to translate -EFAULT into the appropriate
errno for their context? Right now the splice path returns -EFAULT to
userspace for a pipe-full condition, where standard tcp_splice_read()
would return -EAGAIN:

tcp_bpf_splice_read()
  -> __tcp_bpf_recvmsg()
    -> sk_msg_read_core() with sk_msg_splice_actor
      -> add_to_pipe() returns -EAGAIN (pipe full)
      -> actor returns 0
      -> copied = 0 ? 0 : -EFAULT
    returns -EFAULT
  returns -EFAULT to userspace

> diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
> --- a/net/ipv4/tcp_bpf.c
> +++ b/net/ipv4/tcp_bpf.c

[ ... ]

> +static int sk_msg_splice_actor(void *arg, struct page *page,
> +			       unsigned int offset, size_t len)
> +{
> +	struct tcp_bpf_splice_ctx *ctx = arg;
> +	struct pipe_buffer buf = {
> +		.ops = &nosteal_pipe_buf_ops,
> +	};
> +	ssize_t ret;
> +
> +	buf.page = alloc_page(GFP_KERNEL);
> +	if (!buf.page)
> +		return 0;
> +
> +	memcpy(page_address(buf.page), page_address(page) + offset, len);
> +	buf.offset = 0;
> +	buf.len = len;
> +
> +	/*
> +	 * add_to_pipe() calls pipe_buf_release() on failure, which
> +	 * handles put_page() via nosteal_pipe_buf_ops, so no explicit
> +	 * cleanup is needed here.
> +	 */
> +	ret = add_to_pipe(ctx->pipe, &buf);
> +	if (ret <= 0)
> +		return 0;
> +	return ret;
> +}

Both the alloc_page failure and the add_to_pipe failure return 0 here.
Since sk_msg_read_core() converts actor-returning-0 into -EFAULT when
no bytes have been copied yet, this means userspace can see -EFAULT for
what is actually -ENOMEM (allocation failure) or -EAGAIN (pipe full).
The alloc_page case under memory pressure seems particularly confusing,
as -EFAULT gives no indication that backing off might help.



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: tcp_bpf: add splice_read support for sockmap
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/22658462977

@kernel-patches-daemon-bpf
Copy link
Author

@kernel-patches-daemon-bpf
Copy link
Author

Upstream branch: 4faa189
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1061046
version: 1

@kernel-patches-daemon-bpf
Copy link
Author

Upstream branch: 748f9c6
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1061046
version: 1

@kernel-patches-daemon-bpf
Copy link
Author

Upstream branch: 6dd780f
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1061046
version: 1

@kernel-patches-daemon-bpf
Copy link
Author

Upstream branch: 099bded
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1061046
version: 1

@kernel-patches-daemon-bpf
Copy link
Author

Upstream branch: bd2e02e
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1061046
version: 1

@kernel-patches-daemon-bpf
Copy link
Author

Upstream branch: bd2e02e
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1061046
version: 1

@kernel-patches-daemon-bpf
Copy link
Author

Upstream branch: 0c55d48
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1061046
version: 1

mrpre added 7 commits March 10, 2026 18:11
Add a splice_read function pointer to struct proto between recvmsg and
splice_eof. Set it to tcp_splice_read in both tcp_prot and tcpv6_prot.

Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
…am_ops

Add inet_splice_read() which dispatches to sk->sk_prot->splice_read
via INDIRECT_CALL_1. Replace the direct tcp_splice_read reference in
inet_stream_ops and inet6_stream_ops with inet_splice_read.

Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Refactor the read operation with no functional changes.

tcp_bpf has two read paths: strparser and non-strparser. Currently
the differences are implemented directly in their respective recvmsg
functions, which works fine. However, upcoming splice support would
require duplicating the same logic for both paths. To avoid this,
extract the strparser-specific differences into an independent
abstraction that can be reused by splice.

For ingress_msg data processing, introduce a function pointer
callback approach. The current implementation passes
sk_msg_recvmsg_actor(), which performs copy_page_to_iter() - the
same copy logic previously embedded in sk_msg_recvmsg(). This
provides the extension point for future splice support, where a
different actor can be plugged in.

Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Implement splice_read for sockmap using an always-copy approach.
Each page from the psock ingress scatterlist is copied to a newly
allocated page before being added to the pipe, avoiding lifetime
and slab-page issues.

Add sk_msg_splice_actor() which allocates a fresh page via
alloc_page(), copies the data with memcpy(), then passes it to
add_to_pipe(). The newly allocated page already has a refcount
of 1, so no additional get_page() is needed. On add_to_pipe()
failure, no explicit cleanup is needed since add_to_pipe()
internally calls pipe_buf_release().

Also fix sk_msg_read_core() to update msg_rx->sg.start when the
actor returns 0 mid-way through processing. The loop processes
msg_rx->sg entries sequentially — if the actor fails (e.g. pipe
full for splice, or user buffer fault for recvmsg), prior entries
may already be consumed with sge->length set to 0. Without
advancing sg.start, subsequent calls would revisit these
zero-length entries and return -EFAULT. This is especially
common with the splice actor since the pipe has a small fixed
capacity (16 slots), but theoretically affects recvmsg as well.

Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
The previous splice_read implementation copies all data through
intermediate pages (alloc_page + memcpy). This is wasteful for
skb fragment pages which are allocated from the page allocator
and can be safely referenced via get_page().

Optimize by checking PageSlab() to distinguish between linear
skb data (slab-backed) and fragment pages (page allocator-backed):

- For slab pages (skb linear data): copy to a page fragment via
  sk_page_frag, matching what linear_to_page() does in the
  standard TCP splice path (skb_splice_bits). get_page() is
  invalid on slab pages so a copy is unavoidable here.
- For non-slab pages (skb frags): use get_page() directly for
  true zero-copy, same as skb_splice_bits does for fragments.

Both paths use nosteal_pipe_buf_ops. The sk_page_frag approach
is more memory-efficient than alloc_page for small linear copies,
as multiple copies can share a single page fragment.

Benchmark results with rx-verdict-ingress mode (loopback, 8 CPUs):

  splice(2) + always-copy:  ~2770 MB/s (before this patch)
  splice(2) + zero-copy:    ~4270 MB/s (after this patch, +54%)
  read(2):                  ~4292 MB/s (baseline for reference)

Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Add splice_read coverage to sockmap_basic and sockmap_strp selftests.
Each test suite now runs twice: once with normal recv_timeout() and
once with splice-based reads, verifying that data read via splice(2)
through a pipe produces identical results.

A recv_timeout_with_splice() helper is added to sockmap_helpers.h
that creates a temporary pipe, splices data from the socket into
the pipe, then reads from the pipe into the user buffer. MSG_PEEK
calls fall back to native recv since splice does not support peek.
Non-TCP sockets also fall back to native recv.

The splice subtests are distinguished by appending " splice" to
each subtest name via a test__start_subtest macro override.

./test_progs -a sockmap_*
...
Summary: 5/830 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Add --splice option to bench_sockmap that uses splice(2) instead of
read(2) in the consumer path. A global pipe is created once during
setup and reused across iterations to avoid per-call pipe creation
overhead.

When --splice is enabled, the consumer splices data from the socket
into the pipe, then reads from the pipe into the user buffer. The
socket is set to O_NONBLOCK to prevent tcp_splice_read() from
blocking indefinitely, as it only checks sock->file->f_flags for
non-blocking mode, ignoring SPLICE_F_NONBLOCK.

Also increase SO_RCVBUF to 16MB to avoid sk_psock_backlog being
throttled by the default sk_rcvbuf limit, and add --verify option
to optionally enable data correctness checking (disabled by default
for benchmark accuracy).

Benchmark results with rx-verdict-ingress mode (loopback, 8 CPUs):

  read(2):                  ~4292 MB/s
  splice(2) + zero-copy:    ~4270 MB/s
  splice(2) + always-copy:  ~2770 MB/s

Zero-copy splice achieves near-parity with read(2), while the
always-copy fallback is ~35% slower.

Usage:
  # Steer softirqs to CPU 7 to avoid contending with the producer CPU
  echo 80 > /sys/class/net/lo/queues/rx-0/rps_cpus
  # Raise the receive buffer ceiling so the benchmark can set 16MB rcvbuf
  sysctl -w net.core.rmem_max=16777216
  # Run the benchmark
  ./bench sockmap --rx-verdict-ingress --splice -c 2 -p 1 -a -d 30

Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
@kernel-patches-daemon-bpf
Copy link
Author

Upstream branch: e95e85b
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1061046
version: 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant